Predicting the 2023 NBA Champion using Data Science

Collaborators: Ayman Chowdhury, Keshav Ganapathy, Julian Javillo, Sahit Kadthala

Introduction

The analysis of this project was formulated with the intention of exploring the key steps involved in data science. Towards this end, we have taken it upon ourselves to investigate how an NBA team's common regular season metrics such as its average points per game, avergae rebounds per game, average field goal percentage, and so forth over the course of the year influence its probability at winning a championship.

For more context, the NBA stands for the National Basketball Association and is a basketball league that has featured the premier basketball players in the world since its inception roughly 75 years ago. Every single season takes place from around the start of November to the start of June and includes a tournament at the end called the playoffs whose winner is deemed the "champion." Each one can be split into two chunks -- one chunk prior to the playoff tournament (called the regular season) and the other being the playoff tournament itself. Roughly the top half of NBA teams in terms of win-to-loss ratio (also known as win-loss record) make it to the playoffs, with each team paired with another and the winner of the match-up moving on to the next stage. As such, we wish to see how teams' regular season metrics can be used to predict who the nba champion will end up being. Read this article if interested in more details of how the playoffs work.

The reason we focus on the NBA is multi-fold. For one, due to the nature of how basketball is played and the meticulousness of the NBA's statkeeping, there is plenty of data that exists with regard to NBA teams' regular season performance. This means that there is plenty of data that we can use to analyze and that we can utilize to showcase the essence of data science. A quick note that should be made is that other metrics exist besides a team's points per game, rebounds per game, and so forth in the form of "advanced stats," but we want to focus on more general ones since those tend to be the ones looked upon by the common fan. Another reason is the sheer relevance and importance of the NBA and its playoffs as well as what it means for millions. Right now, the NBA playoffs are taking place, so the topic is fresh in the minds of many. Just to get an idea of how much of a cultural phenomenon the NBA and its playoffs are, according to Front Office Sports, the NBA made over 10 billion dollars in revenue (see this article for more details) and is commonly referenced in many popular shows and music songs of the day. The NBA brings a myriad of people together, hoping for their teams to get their hands on the grand prize of a championship, which gives a sense of how meaningful the topic of who will end up winning will be.

With that being said, we shall commence with our tutorial of the data science process. The steps go as follows:

1.) Data Collection
2.) Data Processing
3.) Exploratory Analysis and Visualization
4.) Machine Learning Modeling
5.) Insight and Conclusions

Note: There are embedded links throughout to provide more information on any given topics.

Data Collection

This is the first step of the data science lifecycle where we will be looking for and retrieving datasets related to the NBA. It is important to find a source of data that is reputable and for this topic, we were able to find a source called Basketball Reference which is a well known statistics database for all players and teams in the NBA. We opted for the concept of web scraping to retrieve the data because our intention is to collect data from the last 41 years and the alternative to web scraping would be to download and import 40 CSV files which would be more of a hassle, especially for a new data scientist. To be able to predict the next NBA Champion, we will aim to use the 41 years worth of statistics of the successful teams in the past to determine the characteristics that define a winning team.


We will develop this project using the Python language in a Jupyter Notebook which you can learn more about at Dataquest. Below are the import statements for the Python libraries that we will need to use to develop this project along with a headers declaration to provide credentials for our HTML requests. Pandas is a library and data analysis tool that we will use throughout this project to store and manipulate our data.

Imports

In [1]:
#python libraries used for HTML requests
import requests
from bs4 import BeautifulSoup

#standard python libraries
import pandas as pd
import time
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

#ml libraries
from sklearn import tree

#warnings filter for converting a HTML table to Pandas df
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

#used in HTML requests
headers = {'User-Agent': 'Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405',
           'From': 'pleaseletmein@gmail.com'
}

Data Storing

We first create a dataframe that will store all per game statistics for each team during each year. We abbreviate the column names to save space in the dataframe, so here are what each of the abbreviations translate to.

G = Games, W = Wins, L = Losses, MP = Minutes Played, FG = Field Goal, FGA = Field Goal Attempts, 3P = 3 Pointers Made, 3PA = 3 Pointers Attempted, 3P% = 3 Point Percentage, 2P = 2 Pointers Made, 2PA = 2 Pointers Attempted, 2P% = 2 Point Percentage, FT = Free Throws Made, FTA = Free Throws Attempted, FT% = Free Throw Percentage, ORB = Offensive Rebounds, DRB = Defensive Rebounds, TRB = Total Rebounds, AST = Assists, STL = Steals, BLK = Blocks, TOV = Turnovers, PF = Personal Fouls, PTS = Points, OR = Offensive Rebounds, DR = Defensive Rebounds

In [2]:
#Set the columns of the pandas dataframe
df_per_game = pd.DataFrame(columns=["Team", "G", "MP", "FG", "FGA", "FG%", "3P", "3PA", "3P%", "2P", "2PA", "2P%", "FT",
                          "FTA", "FT%", "ORB", "DRB", "TRB", "AST", "STL", "BLK", "TOV", "PF", "PTS", "Year", "WINNING_STATUS"])

df_advanced = pd.DataFrame(columns=["Team","Year","W", "L", "OR", "DR"])

We then create Basketball Reference URLs for each NBA season from 1980-1981 to 2021-2022 and scrape the per-game stats from each year into our dataframe. We note that if a team name has a * at the end, we mark them as a playoff team in our dataframe by setting their Winning Status to 1. The team that won the NBA Finals each season is marked with a Winning Status of 2 to differentiate the champions' statistics from the rest of the league.

As a general reference, it is very important to refer to databases' specific API restrictions to avoid issues regarding timeouts and blackouts from certain websites. Many websites also have their own specific formatting, which may change how data is scraped from website to website. Some websites may completely ban web scraping in which case we must respect their rules and avoid scraping their data.

The Basketball Reference website allows web scraping, but it restricts us to a maximum of 20 requests per minute to avoid overloading their servers. As a result, we introduce a 5 second time delay when retrieving our 41 seasons worth of data so that we don't exceed the limit. We will be scraping two major tables containing all of the statistics which we need and the tables are called Per Game Statistics and Advanced Stats and these two tables exist for every season we want to analyze. The process of webscraping can be found below with comments describing the key details.

NOTE: This will take a few minutes to run so try run it as few times as possible

In [3]:
#for loop to repeat process for every season between 1980 season and 2022 season
for year in range(1981, 2023):
    #Gets URL for every requested year and retreives HTML
    yearlyURL = 'https://www.basketball-reference.com/leagues/NBA_{}.html'.format(year)
    x = requests.get(yearlyURL, headers=headers)
    soup = BeautifulSoup(x.text)
    table = soup.find('table', id ='per_game-team')
    table_advanced = soup.find('table', id ='advanced-team')
    
    #scrape advanced stats for wins and losses
    for row in table_advanced.tbody.findAll('tr'):
        columns = row.findAll('td')
        teamName = columns[0].text.strip()
        if (columns[0].text.strip()[-1] == "*"):
            teamName = teamName[:-1]

        columns = row.findAll('td')
        new_row = pd.Series({"Team":teamName, "Year": year, "W": float(columns[2].text.strip()), "L": float(columns[3].text.strip()), 
                             "OR": float(columns[9].text.strip()),"DR": float(columns[10].text.strip())})
        df_advanced = df_advanced.append(new_row, ignore_index=True)
    
    #scrap per game statistics for remaining statistics
    for row in table.tbody.findAll('tr'):
        columns = row.findAll('td')
        
        #process whether the team made the playoffs/won the finals
        winningStatus = 0
        teamName = columns[0].text.strip()
        if (columns[0].text.strip()[-1] == "*"): 
            winningStatus = 1
            teamName = teamName[:-1]
        
        #develop new row for each team in the current season and append to Pandas df
        new_row = pd.Series({"Team": teamName, "G": float(columns[1].text.strip()), "MP": float(columns[2].text.strip()), "FG": float(columns[3].text.strip()), 
                    "FGA": float(columns[4].text.strip()), "FG%": float(columns[5].text.strip()), "3P": float(columns[6].text.strip()), "3PA": float(columns[7].text.strip()),
                   "3P%": float(columns[8].text.strip()), "2P": float(columns[9].text.strip()), "2PA": float(columns[10].text.strip()), "2P%": float(columns[11].text.strip()),
                   "FT": float(columns[12].text.strip()), "FTA": float(columns[13].text.strip()), "FT%": float(columns[14].text.strip()), "ORB": float(columns[15].text.strip()),
                   "DRB": float(columns[16].text.strip()), "TRB": float(columns[17].text.strip()), "AST": float(columns[18].text.strip()), "STL": float(columns[19].text.strip()),
                   "BLK": float(columns[20].text.strip()), "TOV": float(columns[21].text.strip()), "PF": float(columns[22].text.strip()), "PTS": float(columns[23].text.strip()),
                    "Year": year, "WINNING_STATUS": winningStatus})
        df_per_game = df_per_game.append(new_row, ignore_index=True)

    df_per_game.loc[(df_per_game["Team"] == soup.findAll("p")[2].find("a").text.strip()) & (df_per_game["Year"] == year), "WINNING_STATUS"] = 2

    #sleep delay to prevent overloading the Basketball Reference servers with web scraping
    time.sleep(5)

We are now left with a Pandas dataframe with each team's per game statistics from the 1980-1981 season to the 2021-2022 season and another dataframe containing each team's Wins and Losses for each season. We can analyze these dataframes with Pandas functions, along with functions from other libraries, in order to explore the data and find any interesting trends.

Data Processing

Now we will take our two dataframes containing all of the statistics we need and merge them using pd.concat() into one final dataframe that we will use from now on for our analysis. We can now move on to calculating the winning percentage for each observation. The reason we are doing this is that the number of games across NBA history has not been exactly the same. Although there have generally been 82 games in an NBA season since the 1980 season, there were a couple times where there was a shortened nba season -- called a lockout season -- shortened due to a labor dispute between the players and the NBA. As such, winning percentage allows a more fair comparison across the seasons.

In [4]:
df = pd.merge(df_per_game, df_advanced, on=['Team', 'Year'])
df.Year = df.Year.apply(pd.to_numeric)
df.W = df.W.apply(pd.to_numeric)
df.L = df.L.apply(pd.to_numeric)
df['W%'] = round(df['W']/df['G'], 2)

The next step is to account for the fact that some teams changed locations over the course of the 41 year dataset. To account for this, we will substitute old team names with modern team names using Pandas functions to reduce the number of individual teams that we are working with.

In [5]:
df.loc[df['Team']=="Charlotte Bobcats", 'Team'] = "Charlotte Hornets"
df.loc[df['Team']=="Kansas City Kings", 'Team'] = "Sacramento Kings"
df.loc[df['Team']=="New Jersey Nets", 'Team'] = "Brooklyn Nets"
df.loc[df['Team']=="New Orleans Hornets", 'Team'] = "New Orleans Pelicans"
df.loc[df['Team']=="New Orleans/Oklahoma City Hornets", 'Team'] = "New Orleans Pelicans"
df.loc[df['Team']=="San Diego Clippers", 'Team'] = "Los Angeles Clippers"
df.loc[df['Team']=="Seattle SuperSonics", 'Team'] = "Oklahoma City Thunder"
df.loc[df['Team']=="Vancouver Grizzlies", 'Team'] = "Memphis Grizzlies"
df.loc[df['Team']=="Washington Bullets", 'Team'] = "Washington Wizards"

We then reformat the columns of the dataframe to make the data easier to read and easier to parse in our visualization

In [6]:
df = df[["Team", "G", "Year","WINNING_STATUS", "W", "L", "W%", "OR", "DR", "MP", "FG", "FGA", "FG%", "3P", "3PA", "3P%", "2P", "2PA", "2P%", "FT",
                          "FTA", "FT%", "ORB", "DRB", "TRB", "AST", "STL", "BLK", "TOV", "PF", "PTS"]]
df
Out[6]:
Team G Year WINNING_STATUS W L W% OR DR MP ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
0 Denver Nuggets 82.0 1981 0 37.0 45.0 0.45 109.4 109.8 243.4 ... 0.783 16.2 30.5 46.6 24.8 8.8 4.6 17.6 25.7 121.8
1 Milwaukee Bucks 82.0 1981 1 60.0 22.0 0.73 108.7 101.8 240.9 ... 0.770 15.4 29.4 44.7 28.3 10.5 6.5 19.3 26.8 113.1
2 San Antonio Spurs 82.0 1981 1 52.0 30.0 0.63 108.5 105.7 241.5 ... 0.769 15.9 31.5 47.4 25.0 8.4 7.8 18.7 25.8 112.3
3 Philadelphia 76ers 82.0 1981 1 62.0 20.0 0.76 107.0 99.5 242.1 ... 0.768 13.3 31.9 45.2 28.9 10.5 7.2 20.8 25.1 111.7
4 Los Angeles Lakers 82.0 1981 1 54.0 28.0 0.66 107.6 103.9 241.5 ... 0.729 14.2 30.4 44.6 28.8 9.9 6.7 19.0 23.8 111.2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1167 New York Knicks 82.0 2022 0 37.0 45.0 0.45 110.4 110.5 241.2 ... 0.744 11.5 34.6 46.1 21.9 7.0 4.9 13.3 20.4 106.5
1168 Portland Trail Blazers 82.0 2022 0 27.0 55.0 0.33 107.8 116.9 240.6 ... 0.760 10.4 32.5 42.9 22.9 8.0 4.5 14.5 21.1 106.2
1169 Detroit Pistons 82.0 2022 0 23.0 59.0 0.28 106.0 113.8 241.2 ... 0.782 11.0 32.0 43.0 23.5 7.7 4.8 14.2 21.9 104.8
1170 Orlando Magic 82.0 2022 0 22.0 60.0 0.27 104.5 112.5 241.2 ... 0.787 9.1 35.2 44.3 23.7 6.8 4.5 14.5 19.7 104.2
1171 Oklahoma City Thunder 82.0 2022 0 24.0 58.0 0.29 104.6 112.8 241.5 ... 0.756 10.4 35.2 45.6 22.2 7.6 4.6 14.0 18.3 103.7

1172 rows × 31 columns

Now that we have cleaned all of our required data, we can store it in a local CSV file. This allows us to use the data locally from now on and not have to ever re-run the web scraping cell to retrieve the original dataset.

In [7]:
#write pandas dataframe to a local csv file that we create
df.to_csv('NBATeamStatistics.csv', index=False)

Note: This cell below is the first cell you will have to run in the future to retrieve the original dataset and save it as df. The read_csv() function reads all of our web scraped data from a local CSV file, so you no longer need to web scrape for the data, saving time in the development process of this project.

In [8]:
#read data from local csv
df = pd.read_csv('NBATeamStatistics.csv')
df
Out[8]:
Team G Year WINNING_STATUS W L W% OR DR MP ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
0 Denver Nuggets 82.0 1981 0 37.0 45.0 0.45 109.4 109.8 243.4 ... 0.783 16.2 30.5 46.6 24.8 8.8 4.6 17.6 25.7 121.8
1 Milwaukee Bucks 82.0 1981 1 60.0 22.0 0.73 108.7 101.8 240.9 ... 0.770 15.4 29.4 44.7 28.3 10.5 6.5 19.3 26.8 113.1
2 San Antonio Spurs 82.0 1981 1 52.0 30.0 0.63 108.5 105.7 241.5 ... 0.769 15.9 31.5 47.4 25.0 8.4 7.8 18.7 25.8 112.3
3 Philadelphia 76ers 82.0 1981 1 62.0 20.0 0.76 107.0 99.5 242.1 ... 0.768 13.3 31.9 45.2 28.9 10.5 7.2 20.8 25.1 111.7
4 Los Angeles Lakers 82.0 1981 1 54.0 28.0 0.66 107.6 103.9 241.5 ... 0.729 14.2 30.4 44.6 28.8 9.9 6.7 19.0 23.8 111.2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1167 New York Knicks 82.0 2022 0 37.0 45.0 0.45 110.4 110.5 241.2 ... 0.744 11.5 34.6 46.1 21.9 7.0 4.9 13.3 20.4 106.5
1168 Portland Trail Blazers 82.0 2022 0 27.0 55.0 0.33 107.8 116.9 240.6 ... 0.760 10.4 32.5 42.9 22.9 8.0 4.5 14.5 21.1 106.2
1169 Detroit Pistons 82.0 2022 0 23.0 59.0 0.28 106.0 113.8 241.2 ... 0.782 11.0 32.0 43.0 23.5 7.7 4.8 14.2 21.9 104.8
1170 Orlando Magic 82.0 2022 0 22.0 60.0 0.27 104.5 112.5 241.2 ... 0.787 9.1 35.2 44.3 23.7 6.8 4.5 14.5 19.7 104.2
1171 Oklahoma City Thunder 82.0 2022 0 24.0 58.0 0.29 104.6 112.8 241.5 ... 0.756 10.4 35.2 45.6 22.2 7.6 4.6 14.0 18.3 103.7

1172 rows × 31 columns

Tidying Data

A key aspect of data processing is ensuring that our dataframe is tidy. A tidy dataframe should have column names that are variables, not values. When that is not the case, then our data is called "messy." For example, let's say there is a table about grades that students received on a particular exam. If there is a column for each letter grade and the table's data points are simply the frequency of each letter, then the data is messy. Look at this resource if interested in more information.

Based on the table above, it is clear that each column represents a variable, each observation contains values, and there is no mish-mash of observational units, so our table is already tidy -- there is no need to do table manipulation (melting, etc) to make it so.

Accounting for Missing Values

We also have to account for missing values and the relationship they hold to the observed and unobserved values that we have collected. In order to see whether or not there are missing values in the first place, we will run: df.isna().any(). If the result is false for each and every column, then that means we do not need to estimate the missing values or do other such procedures.

In [9]:
df.isna().any()
Out[9]:
Team              False
G                 False
Year              False
WINNING_STATUS    False
W                 False
L                 False
W%                False
OR                False
DR                False
MP                False
FG                False
FGA               False
FG%               False
3P                False
3PA               False
3P%               False
2P                False
2PA               False
2P%               False
FT                False
FTA               False
FT%               False
ORB               False
DRB               False
TRB               False
AST               False
STL               False
BLK               False
TOV               False
PF                False
PTS               False
dtype: bool

As can be seen from the above output, the value is false for each column in our dataframe, which means that there are no missing values.

Exploratory Analysis and Visualization

The next stage of our data science journey is exploratory analysis and visualization. This stage entails taking a look at various graphs and statistics in order to get a feel for some of the relationships that exist in our data. This should then give us an idea of what factors we should try testing when creating our machine learning model later on in the project. In order to plot, we used the matplotlib.pyplot library, which is used quite commonly. Check out this documentation to see some more about its functionality out.

First, we create a correlation map of the variables of our dataframe in order to see which pairs of them are highly correlated and account for collinearity. It is not ideal to use predictors that are very interrelated with one another because it then becomes more tricky to see the effects of either of them and can skew models that are based upon them.

In [10]:
# correlation map of all the variables
plt.figure(figsize=(15, 10))
sns.heatmap(df.corr(), annot=True, cmap="YlGnBu")
plt.title("Correlation Map of All Variables")
plt.show()

As points per game are the most iconic and focused on metric, we will begin our inquiry with a preliminary investigation into that factor:

Preliminary Investigation

We first create a copy of our dataframe so that we can analyze its features without affecting the data that we have collected.

In [11]:
df_cp1 = df.copy()

After this, we set our dataframe to be indexed by the year. We do this so that when we create our plot, the average points per game will be plotted in relation to the year.

In [12]:
df_cp1.set_index("Year", inplace=True) 

We then group our dataframe by the team and plot the average points per game per year of every team into a line graph.

In [13]:
df_cp1.groupby("Team")["PTS"].plot(legend=False, xlabel= "Year", ylabel ="Average Points Per Game", 
                                                 title = "Distributions of Teams PPG Scored from 1981 to 2022");

This line graph shows that the average points scored per game decreased from 1980-2000, but has recently been on the rise from 2000-2022. However, this does not give much insight into any particular team that has been excelling, which shows that we must observe more data to reach a more concrete hypothesis.

To make the graph more readable, we group the data by the year and take the total mean of all the team's points per game and graph into a single line.

In [14]:
df_cp1.groupby("Year")["PTS"].mean().plot(legend=False, xlabel= "Year", ylabel ="Average Points Per Game", 
                                                 title = "Distributions of Teams PPG Scored from 1981 to 2022");

This graph clearly shows the decrease in the average points per game from 1980-2000, and the rise from 2000-2022.

We then create another copy of our dataframe and plot the average points per game of the champions by extracting teams that were marked as winning. This should give us some idea if the number of points scored per game is one way in which the NBA champion distinguished itself during the regular season and can possibly be a good metric in determing the NBA champ.

In [15]:
df_cp1 = df.copy()
championPointsDF = df_cp1.copy()
pointsAverages = championPointsDF.groupby("Year")["PTS"].mean()

newChampionPointsDF = championPointsDF.loc[championPointsDF["WINNING_STATUS"] == 2] 
newChampionPointsDF = newChampionPointsDF[["Year", "PTS"]]


plt.plot(pointsAverages.index.to_numpy(), pointsAverages.values, label = "league average")
plt.plot(newChampionPointsDF["Year"].to_numpy(), newChampionPointsDF["PTS"].to_numpy(), label = "champion")
plt.title("Average Team Points/Game vs NBA Champion Points/Game over Time")
plt.xlabel("Year")
plt.ylabel("Points per Game")
plt.legend()
plt.show()

This graph shows that the average points per game of the champions followed the same distribution as the rest of the league per year. However, the champions consistently had a higher points per game per year than the rest of the league average, suggesting that that is an important metric to keep track of. We can further see that it is an important distinguisher by performing a simple hypothesis test.

Hypothesis Test Explanation and Quick Application

Hypothesis testing is a procedure performed in statistics to see whether or not a result is significant based on some initial assumption that we have. This initial assumption is known as the null hypothesis. In a hypothesis test, we calculate a certain statistical value (known as the test statistic) and then compare that to the value we expected it to have based on our null hypothesis. Just because it is different is not necessarily indicative that the null hypothesis is false, as naturally, it is possible to have some slight variations in the data that we have. Instead, based on the quantity we calculated and statistical tables, we can calculate a p-value that conveys the probability of obtaining a test statistic that is as contradictory to the null hypothesis value as the one that was obtained. If the p-value is low, then that means that there is enough ground to reject the null hypothesis and accepting the alternative hypothesis (what we test the null hypothesis against); otherwise, we fail to reject the null hypothesis. The value that we use as the threshold for the p-value to determine if it is low or high enough is called the significance level, and it is usually set to 0.05. Read this article to get a more in-depth understanding of hypothesis testing.

As for how we can apply a simple hypothesis test here, since the null hypothesis is what we accept by default and the alternative hypothesis is what we need evidence for, we will make our null hypothesis that the mean of the NBA's points per game is equal to the NBA champions' points per game. The alternative hypothesis that we are testing it against is that the mean of the NBA's points per game is less than the NBA champoins' points per game. Below, we use the scipy.tests library in order to test:

In [17]:
import scipy.stats

league_ppt_avg = pointsAverages.values
champion_ppt = newChampionPointsDF["PTS"].to_numpy()

t_stat, p_value = scipy.stats.ttest_ind(league_ppt_avg, champion_ppt, alternative = "less")

# print results
print(f"t-statistic: {t_stat}")
print(f"p-value: {p_value}")
t-statistic: -2.1784915858683145
p-value: 0.01612030662633339

As we can see, the p-value is less than 0.05, which means we can reject the null hypothesis and conveys that the fact that the NBA champions' points per game being greater (or in other words, the NBA's teams scoring less than the champion during the regular season) is statistically significant. Now that we have done that, let us investigate further into the nature of the points scored by each team in the league.

More Investigation into the nature of Points per Game

Since 1980, there were a few moments in which the NBA added a few more teams to the league. As such, a possible hypothesis for this dip is that a higher number of teams decreases the points per game because it disperses the existing offensive talent over more teams, thus making the average team worse. To observe whether this is true, we will group and count the number of teams, and plot it against the year.

In [18]:
df_cp1.groupby("Year")["Team"].count().plot(xlabel = "Year", ylabel = "Number of Teams",
                                           title = "Number of Teams vs Year");

From this graph, we can easily see the spikes that have occurred in the number of teams in modern NBA history. Although there is some overlap between the increase in teams in 1988 and in the 90s and the decrease in points per game in the "Average Team Points/Game vs NBA Champion Points/Game over Time" graph, there are some notable discrepancies that suggest that the number of teams is not necessarily something that we have to consider when modeling later. For example, in the average points per game graph, the major decrease in scoring starts roughly around 1983 even though the number of teams did not increase in that decade until about 1988. Additionally, the 2000s has featured an explosion in NBA's scoring despite the addition of teams earlier on in the turn of the millienium. We can also see these things in the following graph, where we graphed the number of teams against the average number of points scored.

In [19]:
team_counts = df_cp1.groupby('Year')['Team'].nunique()
df_cp1['NumTeams'] = df_cp1['Year'].map(team_counts)
avg_points_per_team_num = df_cp1.groupby("NumTeams")["PTS"].mean()
df_cp1["NumTeamsPtsAvg"] = df_cp1["NumTeams"].map(avg_points_per_team_num)
plt.plot(df_cp1["NumTeams"].to_numpy(), df_cp1["NumTeamsPtsAvg"].to_numpy())
plt.xlabel("Number of Teams")
plt.ylabel("Average Points per Game")
plt.title("Average Points per Game vs. Number of Teams")
plt.show()

In the graph above, we can see that, even though there is some relationship between the number of teams that have been around and the average points per game, more recent times (past the 2000s especially), when there have been more teams than ever before, has very strongly challenged this notion, so we will not further explore this relationship. Let us explore some other metrics that influence the success of the NBA teams that have become champions.

Exploration into the influence of other metrics

Below, we are making some violin plots in order to observe the ranges in values between the various teams in the league as well as the distribution of values over said range over the last many years in NBA history. Alongside these violin plots, we also have a graph of how each NBA champion performed in that respective category. The purpose behind this side-to-side comparison is to get an idea of whether NBA champions tended to be elite in particular categories over others. If the value of an NBA champ for a particular year corresponds to the upper half of the violin plot for that same very year, then we have a sense that the NBA champion of that year was particularly good in that department. First, we will take a look into the number of field goals made.

In [21]:
fgcompareDF = df.copy()
fgcompareChamp = fgcompareDF.loc[fgcompareDF["WINNING_STATUS"] == 2]
fig, (ax1, ax2) = plt.subplots(ncols = 2, figsize=(20, 6))
sns.violinplot(x="Year", y="FG", data=fgcompareDF, inner="point", ax = ax1)

ax2.plot(fgcompareChamp["Year"].to_numpy(), fgcompareChamp["FG"].to_numpy())
ax2.set_title("Number of Field Goals by Year")
ax2.set_xlabel("Year")
ax2.set_ylabel("Field Goals")
ax1.set_ylim(ax2.get_ylim())
plt.show()

As can be seen in the plots above, the NBA champions often had field goal values that corresponded to the upper half of the violin plots, indicating that the NBA champ of a particular season tended to score a lot of field goals during its regular season games. Now, let us move on to the number of two point shots made.

In [26]:
p2compareDF = df.copy()
p2compareChamp = fgcompareDF.loc[p2compareDF["WINNING_STATUS"] == 2]
fig, (ax1, ax2) = plt.subplots(ncols = 2, figsize=(35, 10))
sns.violinplot(x="Year", y="2P", data=p2compareDF, inner="point", ax = ax1)

ax2.plot(p2compareChamp["Year"].to_numpy(), p2compareChamp["2P"].to_numpy())
ax2.set_title("Two Point Shots Made per Game by Year")
ax2.set_xlabel("Year")
ax2.set_ylabel("Two Point Shots Made")
ax1.set_ylim(ax2.get_ylim())
plt.show()

Although there were some years in which it seems as though the NBA champion was not elite as far as two-point-shot-making goes, there are many years in which the NBA champion was elite, so in addition to points per game and number of field goals, number of two-point-shot-makes is also a vital metric to keep track of. We can also look into wins as well.

In [29]:
wcompareDF = df.copy()
wcompareChamp = fgcompareDF.loc[wcompareDF["WINNING_STATUS"] == 2]
fig, (ax1, ax2) = plt.subplots(ncols = 2, figsize=(35, 10))
sns.violinplot(x="Year", y="W", data=wcompareDF, inner="point", ax = ax1)

ax2.plot(wcompareChamp["Year"].to_numpy(), wcompareChamp["W"].to_numpy())
ax2.set_title("Points per Game by Year")
ax2.set_xlabel("Year")
ax2.set_ylabel("Points per Game")
ax2.set_ylim(ax1.get_ylim())
plt.show()

The number of regular seasons win appears to also be something that NBA champions were elite in. As a matter of fact, out of these violin plots, this is the category in which the NBA champions have most consistently performed well in. This makes sense because most of the NBA's champions had the highest or close to the highest win percentage. Now that we have already seen that points per game, field goals made, two point shots made, and wins/win percentage are key metrics, we can move our exploration into the relationship between regular season winning percentage and these other metrics -- if there is a strong connection between regular season winning percentage and some other metric, then that metric could be something we could consider in our model:

In [30]:
df_cp2 = df.copy()
for stat in df_cp2.loc[:, "OR":"PTS"]:
    df_cp2 = df.copy()
    winper = df_cp2.groupby("Team")["W%"].mean().reset_index()
    statdf = df_cp2.groupby("Team")[stat].mean().reset_index()

    df_cp2 = pd.merge(statdf, winper, how='left', on = 'Team')
    plt.figure(figsize=(6,6))
    sns.regplot(x='W%', y=stat, data=df_cp2)
    for x, y, label in zip(df_cp2["W%"], df_cp2[stat], df_cp2["Team"]):
        plt.text(x, y, label, ha='center', va='center')

    plt.title("Win Percentage vs " + stat)
    plt.xlabel("Win Percentage")
    plt.ylabel(stat)
/tmp/ipykernel_73830/3123756699.py:8: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`.
  plt.figure(figsize=(6,6))

These graphs reveal that some statistics such as FG, FG%, 2P, 2P%, FT, FTA, and OR have a high correlation with the regular season winning percentage. To confirm these metrics, we will observe if the same correlation exists within the season champions of each year

We will then create a similar plot but instead plot both the league and champions' statistic vs win percentage. We use the pandas functon loc to filter out teams that are marked as winning the season, and obtain the mean of their win percentages and statistics into a different dataframe. We then plot both the entire league's win pecentage vs statistic and the champions' on the same plot to observe the difference between the two.

In [31]:
df_cp2 = df.copy()

#iterates through each statistic not including W/L and W%
for stat in df_cp2.loc[:, "OR":"PTS"]:
    df_cp2 = df.copy()
    winners = df_cp2.loc[df_cp2["WINNING_STATUS"]==2] #extracts the winners
    winper = df_cp2.groupby("Team")["W%"].mean().reset_index() #gets the mean of W% by team
    statdf = df_cp2.groupby("Team")[stat].mean().reset_index() #gets the mean of the stat by team

    winperchamp = winners.groupby("Team")["W%"].mean().reset_index() #gets the champions' mean of W% by team
    statchamp = winners.groupby("Team")[stat].mean().reset_index() #gets the champions' mean of the stat by team 

    df_cp2 = pd.merge(statdf, winper, how='left', on = 'Team') #merges the new dataframes created
    plt.figure(figsize=(6,6)) #sets the graph size
    sns.regplot(x='W%', y=stat, data=df_cp2, label = "League") #plots the league's win percentage vs the statistic 
    sns.regplot(x='W%', y=stat, data=winners, label = "Champions") #plots the champions' win percentage vs the statistic 

    #Sets plot title, legend, and x/y labels
    plt.title("Win Percentage vs " + stat + " Between League and Champions")
    plt.legend()
    plt.xlabel("Win Percentage")
    plt.ylabel(stat)
/tmp/ipykernel_73830/450008879.py:14: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`.
  plt.figure(figsize=(6,6)) #sets the graph size

As seen in the graphs above, the champions have a higher average win percentage ranging from 0.6-0.9. Some of the statistics are also differently correlated between the league and the champions. For example, field goal attempts do not have much of a correlation in the league, but has a significant linear regression line amongst the champions. Other stats such as 2P also have a higher correlation amongst champions rather than the rest of the league. On the other hand, some stats have the oppoite effect, with defensive rebounds and 3P having a negative correlation among champions despite having a positive correlation amongst the rest of the league.

This shows us that some statistics are more correlated to the win percentage of the league champions than others, which influences which data we should use to input in our ML models. Statistics such as OR and TRB are consistently correlated with both the league and champions, but others differ between the two, and can lead to different prediction success/failure when inputted into our ML models.

Machine Learning Modeling

This is the fourth step of the data science lifecycle where we try to model our data to make a future prediction.

First we want to get the regular season data for the 2022-2023 regular season. This will allow us to develop a model to run on this data and determine which team will win the 2023 NBA Championship based on the regular season statistics and past trends.

We will begin by web scraping Basketball Reference for the 2023 per game and advanced statistics.

In [32]:
#Set the columns of the pandas dataframe
df_per_game_2023 = pd.DataFrame(columns=["Team", "G", "MP", "FG", "FGA", "FG%", "3P", "3PA", "3P%", "2P", "2PA", "2P%", "FT",
                          "FTA", "FT%", "ORB", "DRB", "TRB", "AST", "STL", "BLK", "TOV", "PF", "PTS", "Year", "WINNING_STATUS"])

df_advanced_2023 = pd.DataFrame(columns=["Team","Year","W", "L", "OR", "DR"])
In [33]:
#Gets URL for every requested year and retreives HTML
year = 2023
yearlyURL = 'https://www.basketball-reference.com/leagues/NBA_2023.html'
x = requests.get(yearlyURL, headers=headers)
soup = BeautifulSoup(x.text)
table = soup.find('table', id ='per_game-team')
table_advanced = soup.find('table', id ='advanced-team')

#scrape advanced stats for wins and losses
for row in table_advanced.tbody.findAll('tr'):
    columns = row.findAll('td')
    teamName = columns[0].text.strip()
    if (columns[0].text.strip()[-1] == "*"):
        teamName = teamName[:-1]

    columns = row.findAll('td')
    new_row = pd.Series({"Team":teamName, "Year": year, "W": float(columns[2].text.strip()), "L": float(columns[3].text.strip()), 
                            "OR": float(columns[9].text.strip()),"DR": float(columns[10].text.strip())})
    df_advanced_2023 = df_advanced_2023.append(new_row, ignore_index=True)

#scrap per game statistics for remaining statistics
for row in table.tbody.findAll('tr'):
    columns = row.findAll('td')
    
    #process whether the team made the playoffs/won the finals
    winningStatus = 0
    teamName = columns[0].text.strip()
    if (columns[0].text.strip()[-1] == "*"): 
        winningStatus = 1
        teamName = teamName[:-1]
    
    #develop new row for each team in the current season and append to Pandas df
    new_row = pd.Series({"Team": teamName, "G": float(columns[1].text.strip()), "MP": float(columns[2].text.strip()), "FG": float(columns[3].text.strip()), 
                "FGA": float(columns[4].text.strip()), "FG%": float(columns[5].text.strip()), "3P": float(columns[6].text.strip()), "3PA": float(columns[7].text.strip()),
                "3P%": float(columns[8].text.strip()), "2P": float(columns[9].text.strip()), "2PA": float(columns[10].text.strip()), "2P%": float(columns[11].text.strip()),
                "FT": float(columns[12].text.strip()), "FTA": float(columns[13].text.strip()), "FT%": float(columns[14].text.strip()), "ORB": float(columns[15].text.strip()),
                "DRB": float(columns[16].text.strip()), "TRB": float(columns[17].text.strip()), "AST": float(columns[18].text.strip()), "STL": float(columns[19].text.strip()),
                "BLK": float(columns[20].text.strip()), "TOV": float(columns[21].text.strip()), "PF": float(columns[22].text.strip()), "PTS": float(columns[23].text.strip()),
                "Year": year, "WINNING_STATUS": winningStatus})
    df_per_game_2023 = df_per_game_2023.append(new_row, ignore_index=True)

df_per_game_2023.loc[(df_per_game_2023["Team"] == soup.findAll("p")[2].find("a").text.strip()) & (df_per_game_2023["Year"] == year), "WINNING_STATUS"] = 2

Then we can merge the 2023 per game and advanced statistics into one Pandas df which will contain all of the features we have been using thus far.

In [34]:
df_2023 = pd.merge(df_per_game_2023, df_advanced_2023, on=['Team', 'Year'])
df_2023.Year = df_2023.Year.apply(pd.to_numeric)
df_2023.W = df_2023.W.apply(pd.to_numeric)
df_2023.L = df_2023.L.apply(pd.to_numeric)
df_2023['W%'] = round(df_2023['W']/df_2023['G'], 2)
df_2023
Out[34]:
Team G MP FG FGA FG% 3P 3PA 3P% 2P ... TOV PF PTS Year WINNING_STATUS W L OR DR W%
0 Sacramento Kings 82.0 241.8 43.6 88.2 0.494 13.8 37.3 0.369 29.8 ... 13.5 19.7 120.7 2023 1 48.0 34.0 119.4 116.8 0.59
1 Golden State Warriors 82.0 241.8 43.1 90.2 0.479 16.6 43.2 0.385 26.5 ... 16.3 21.4 118.9 2023 1 44.0 38.0 116.1 114.4 0.54
2 Atlanta Hawks 82.0 242.1 44.6 92.4 0.483 10.8 30.5 0.352 33.9 ... 12.9 18.8 118.4 2023 1 41.0 41.0 116.6 116.3 0.50
3 Boston Celtics 82.0 243.7 42.2 88.8 0.475 16.0 42.6 0.377 26.2 ... 13.4 18.8 117.9 2023 1 57.0 25.0 118.0 111.5 0.70
4 Oklahoma City Thunder 82.0 242.1 43.1 92.6 0.465 12.1 34.1 0.356 31.0 ... 13.0 21.0 117.5 2023 1 40.0 42.0 115.2 114.2 0.49
5 Los Angeles Lakers 82.0 242.4 42.9 89.0 0.482 10.8 31.2 0.346 32.1 ... 14.1 17.9 117.2 2023 1 43.0 39.0 114.5 113.9 0.52
6 Utah Jazz 82.0 241.5 42.5 89.8 0.473 13.3 37.8 0.353 29.2 ... 15.4 20.5 117.1 2023 0 37.0 45.0 115.8 116.7 0.45
7 Memphis Grizzlies 82.0 241.2 43.7 92.1 0.475 12.0 34.2 0.351 31.7 ... 13.6 20.0 116.9 2023 1 51.0 31.0 115.1 111.2 0.62
8 Milwaukee Bucks 82.0 241.8 42.7 90.4 0.473 14.8 40.3 0.368 27.9 ... 14.6 18.1 116.9 2023 1 58.0 24.0 115.4 111.9 0.71
9 Indiana Pacers 82.0 240.9 42.0 89.6 0.469 13.6 37.0 0.367 28.4 ... 14.9 21.2 116.3 2023 0 35.0 47.0 114.6 117.7 0.43
10 New York Knicks 82.0 243.4 42.0 89.4 0.470 12.6 35.7 0.354 29.4 ... 13.0 20.3 116.0 2023 1 47.0 35.0 117.8 114.8 0.57
11 Denver Nuggets 82.0 240.9 43.6 86.4 0.504 11.8 31.2 0.379 31.8 ... 14.5 18.6 115.8 2023 1 53.0 29.0 117.6 114.2 0.65
12 Minnesota Timberwolves 82.0 241.8 42.9 87.4 0.490 12.2 33.3 0.365 30.7 ... 15.3 21.6 115.8 2023 1 42.0 40.0 113.7 113.8 0.51
13 Philadelphia 76ers 82.0 242.4 40.8 83.8 0.487 12.6 32.6 0.387 28.2 ... 13.7 20.4 115.2 2023 1 54.0 28.0 117.7 113.3 0.66
14 New Orleans Pelicans 82.0 242.1 42.0 87.6 0.480 11.0 30.1 0.364 31.1 ... 14.6 20.5 114.4 2023 1 42.0 40.0 114.4 112.5 0.51
15 Dallas Mavericks 82.0 243.0 40.0 84.3 0.475 15.2 41.0 0.371 24.8 ... 12.2 20.7 114.2 2023 0 38.0 44.0 116.8 116.7 0.46
16 Phoenix Suns 82.0 241.2 42.1 90.1 0.467 12.2 32.6 0.374 29.9 ... 13.5 21.2 113.6 2023 1 45.0 37.0 115.1 113.0 0.55
17 Los Angeles Clippers 82.0 241.8 41.1 86.1 0.477 12.7 33.4 0.381 28.4 ... 14.2 19.5 113.6 2023 1 44.0 38.0 115.0 114.5 0.54
18 Portland Trail Blazers 82.0 240.6 40.5 85.4 0.474 12.9 35.3 0.365 27.6 ... 14.5 20.0 113.4 2023 0 33.0 49.0 114.8 118.8 0.40
19 Brooklyn Nets 82.0 240.6 41.5 85.1 0.487 12.8 33.8 0.378 28.7 ... 13.7 21.1 113.4 2023 1 45.0 37.0 115.0 114.1 0.55
20 Washington Wizards 82.0 240.9 42.1 86.9 0.485 11.3 31.7 0.356 30.9 ... 14.1 18.8 113.2 2023 0 35.0 47.0 114.4 115.6 0.43
21 Chicago Bulls 82.0 242.7 42.5 86.8 0.490 10.4 28.9 0.361 32.1 ... 13.4 18.9 113.1 2023 1 40.0 42.0 113.5 112.2 0.49
22 San Antonio Spurs 82.0 242.1 43.1 92.6 0.465 11.1 32.2 0.345 32.0 ... 15.3 19.9 113.0 2023 0 22.0 60.0 110.2 120.0 0.27
23 Toronto Raptors 82.0 241.5 41.9 91.3 0.459 10.7 32.0 0.335 31.1 ... 11.7 20.0 112.9 2023 1 41.0 41.0 115.5 114.0 0.50
24 Cleveland Cavaliers 82.0 242.4 41.6 85.2 0.488 11.6 31.6 0.367 30.0 ... 13.3 19.0 112.3 2023 1 51.0 31.0 116.1 110.6 0.62
25 Orlando Magic 82.0 241.2 40.5 86.3 0.470 10.8 31.1 0.346 29.8 ... 15.1 20.1 111.4 2023 0 34.0 48.0 111.6 114.2 0.41
26 Charlotte Hornets 82.0 241.8 41.3 90.4 0.457 10.7 32.5 0.330 30.5 ... 14.2 20.3 111.0 2023 0 27.0 55.0 109.2 115.3 0.33
27 Houston Rockets 82.0 240.9 40.6 88.9 0.457 10.4 31.9 0.327 30.2 ... 16.2 20.5 110.7 2023 0 22.0 60.0 111.4 119.3 0.27
28 Detroit Pistons 82.0 241.5 39.6 87.1 0.454 11.4 32.4 0.351 28.2 ... 15.1 22.1 110.3 2023 0 17.0 65.0 110.7 118.9 0.21
29 Miami Heat 82.0 241.5 39.2 85.3 0.460 12.0 34.8 0.344 27.3 ... 13.5 18.5 109.5 2023 1 44.0 38.0 113.0 113.3 0.54

30 rows × 31 columns

Some of these libraries were already loaded at the start of the tutorial, but we will reload them here to emphasize which libraries are meant specifically for our Machine Learning Models.

In [35]:
# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

First Model: Decision Tree

Decision Trees are a machine learning model that learns simple decision rules from our data features in order to create a model that can predict a target variable. We will start with this model because it is one of the more simpler models for a new data scientist to understand.

The first thing we need to do is define our feature columns which are the statistics in the table that we want the model to use. We can start by using all of the features, but we will narrow them down based on which features are deemed redundant by the correlation matrix from earlier as well as trial error to see which features maximize the model accuracy.

Then, we will split our collected data into training and testing subsets. The goal here is to develop the model by having it learn from the training data and test the model by running it on the testing data and measuring its accuracy.

Finally, we can use the DecisionTreeClassifier() function from sklearn to create model using the training and testing subsets.

In [52]:
df_dec = df.copy()

#narrow down features to use for model
feature_cols = ["FG", "FGA", "3P", "3P%", "2P", "FT", "FTA", "FT%", "AST", "STL", "BLK", "PF", "PTS", "Year", "W", "L"]

#create subset of features from original dataframe and define target variable as WINNING STATUS
X = df_dec[feature_cols]
y = df.WINNING_STATUS

#split data into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

#fit the model
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)

#use predict() to determine accuracy
y_pred = clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.8238636363636364

For a Decision Tree model the accuracy can vary every time you run the cell, but the average accuracy measurement hovers around the 0.84 mark. This is a strong accuracy number, so we can go ahead and try to predict the winner of the 2023 NBA Finals using this model.

In [53]:
#predict 2023 winner using 2023 dataframe
X_2023 = df_2023[feature_cols]
y_pred_2023 = clf.predict(X_2023)

# sort predictions in descending order and select the team with the highest prediction value
df_2023['WINNING_STATUS_clfpred'] = y_pred_2023
df_2023 = df_2023.sort_values(by=['WINNING_STATUS_clfpred'], ascending=False)

#pick the best and second best teams and print them
best_team = df_2023.iloc[0]['Team']
secondbest_team = df_2023.iloc[1]['Team']
print('The team with the HIGHEST predicted probability of winning the championship in 2023 is: ' + str(best_team))
print('The team with the second highest predicted probability is: ' + str(secondbest_team))
The team with the HIGHEST predicted probability of winning the championship in 2023 is: Boston Celtics
The team with the second highest predicted probability is: Milwaukee Bucks

As we can see, the team that this model picks to win the Finals is the Boston Celtics and the team with the second highest probability is the Milwaukee Bucks.

Now that we have the prediction we could stop here, but instead we will try to use more models in an attempt to increase the accuracy from 0.84.

Second Model: Support Vector Machines (SVM)

SVMs are a set of machine learning methods used in classification which is helpful for our purposes. We want to classify each team as either winning the finals with a WINNING_STATUS = 2, or making the playoffs with a value of 1, or missing playoffs with a value of 0.

Similar to the previous model, we start by determining which features combine to create the highest accuracy value. Then we split the data into training and testing subsets and fit the model using the subsets.

In [54]:
from sklearn.svm import SVC

df_dec = df.copy()

#narrow down features
feature_cols = ["FG", "FG%", "3P", "3P%", "2P", "2P%", "FT",      "FTA", "FT%", "STL", "BLK", "PF", "PTS",    "W", "L"]
X = df_dec[feature_cols]
y = df.WINNING_STATUS

#split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test

# SVM
svm = SVC(gamma='auto')
svm.fit(X_train, y_train)
sscore = svm.score(X_test, y_test)
print("SVM Accuracy: ", sscore)
SVM Accuracy:  0.8693181818181818

We have improved our accuracy from ~0.84 to ~0.87 which is a significant improvement. Now let's try using this model to predict the 2023 NBA Finals winner.

In [55]:
X_2023 = df_2023[feature_cols]
y_pred_2023 = svm.predict(X_2023)
df_2023['WINNING_STATUS_svmpred'] = y_pred_2023

# Sort the dataframe by WINNING_STATUS_pred in descending order and select the first row
best_team = df_2023.sort_values(by='WINNING_STATUS_svmpred', ascending=False).iloc[0]['Team']
secondbest_team = df_2023.sort_values(by='WINNING_STATUS_svmpred', ascending=False).iloc[1]['Team']
print('The team with the HIGHEST predicted probability of winning the championship in 2023 is: ' + str(best_team))
print('The team with the second highest predicted probability is: ' + str(secondbest_team))
The team with the HIGHEST predicted probability of winning the championship in 2023 is: Boston Celtics
The team with the second highest predicted probability is: Philadelphia 76ers

The SVM model predicts the Boston Celtics to win the NBA Finals with the Philadelphia 76ers as having the second highest probability.

We will now try to use the K Nearest Neighbors (KNN) model and see if we can improve our accuracy any more.

Third Model: K Nearest Neighbors (KNN)

KNN is a machine learning model that stores instances of training data. It may be easier to think of instances as points on a graph and classification is done by doing a majority vote of the nearest neighbors of any query point on the graph to determine what the classification of the query point should be. The accuracy of the model can change based on the number of nearest neighbors being analyzed, so we will change this number to see what the highest possible accuracy will be for a specific value of k.

Just like the previous models, we will first narrow down the features to those that are relevant, and then split the data into training and testing subsets to be used when fitting the model.

In [56]:
from sklearn.neighbors import KNeighborsClassifier

df_dec = df.copy()

#narrow features
feature_cols = ["FG", "FGA", "FG%", "3P", "3P%", "2P", "2P%", "FT",      "FTA", "FT%", "AST", "STL", "BLK", "TOV", "PF", "PTS", "Year", "W", "L", "OR", "DR"]
X = df_dec[feature_cols]
y = df.WINNING_STATUS

#split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test

#fit KNN model
kfin = 0
kscore = 0

#determine accuracy level for varying numbers of nearest neighbors
for k in range(1,21):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    score = knn.score(X_test, y_test)
    if score > kscore:
        kfin = k
        kscore = score
    print("K Nearest Neighbors Accuracy for k = ", k, ": ", score)
K Nearest Neighbors Accuracy for k =  1 :  0.8125
K Nearest Neighbors Accuracy for k =  2 :  0.8295454545454546
K Nearest Neighbors Accuracy for k =  3 :  0.8323863636363636
K Nearest Neighbors Accuracy for k =  4 :  0.8494318181818182
K Nearest Neighbors Accuracy for k =  5 :  0.8636363636363636
K Nearest Neighbors Accuracy for k =  6 :  0.8607954545454546
K Nearest Neighbors Accuracy for k =  7 :  0.875
K Nearest Neighbors Accuracy for k =  8 :  0.8664772727272727
K Nearest Neighbors Accuracy for k =  9 :  0.8693181818181818
K Nearest Neighbors Accuracy for k =  10 :  0.8607954545454546
K Nearest Neighbors Accuracy for k =  11 :  0.8607954545454546
K Nearest Neighbors Accuracy for k =  12 :  0.8721590909090909
K Nearest Neighbors Accuracy for k =  13 :  0.8693181818181818
K Nearest Neighbors Accuracy for k =  14 :  0.8721590909090909
K Nearest Neighbors Accuracy for k =  15 :  0.8636363636363636
K Nearest Neighbors Accuracy for k =  16 :  0.8693181818181818
K Nearest Neighbors Accuracy for k =  17 :  0.8636363636363636
K Nearest Neighbors Accuracy for k =  18 :  0.875
K Nearest Neighbors Accuracy for k =  19 :  0.8721590909090909
K Nearest Neighbors Accuracy for k =  20 :  0.8721590909090909

The output above shows that the highest accuracy for this model is achieved with a k value of 7 and 18, both of which have an accuracy of 0.875. This is an improvement over the previous two models, so we can go ahead and try to use this model to predict the 2023 NBA Finals winner.

In [57]:
X_2023 = df_2023[feature_cols]
y_pred_proba_2023 = knn.predict_proba(X_2023)

# Get indices of teams in descending order of predicted probability for class 2 (Championship)
team_indices = np.argsort(y_pred_proba_2023[:, 2])[::-1]

# Select top two teams
top_teams = df_2023.iloc[team_indices[:2]]['Team'].tolist()

# Set winning status map
winning_status_map = {0: 'Lose', 1: 'Win', 2: 'Championship'}


# Print the two teams with the highest probability to win the championship
print('The team with the HIGHEST predicted probability of winning the championship in 2023 is: ' + str(top_teams[0]))
print('The team with the second highest predicted probability is: ' + str(top_teams[1]))
The team with the HIGHEST predicted probability of winning the championship in 2023 is: Boston Celtics
The team with the second highest predicted probability is: Milwaukee Bucks

The KNN model has predicted the Boston Celtics to win the 2023 NBA Finals with the Milwaukee Bucks having the second highest probability of winning.

The last sklearn model we will use is logistic regression to see if we can increase the accuracy any more.

Fourth Model: Logistic Regression

Logistic Regression is a machine learning model that is used as a classification model. We will try to use it to classify whether an NBA team will win the finals by developing a logistic regression model. This model can do binary or one vs rest predictions and we will do one vs rest because we want to predict whether a team will have a WINNING_STATUS of 2 or 0/1.

We start by narrowing down the features to those that maximize the accuracy and then split data to fit the model.

In [58]:
df_dec = df.copy()

feature_cols = ["FG", "3P", "2P", "FT%", "AST", "STL", "W", "L"]

X = df_dec[feature_cols]
y = df.WINNING_STATUS

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=5000)
log_reg.fit(X_train, y_train)
lscore = log_reg.score(X_test, y_test)
print("Logistic Regression Accuracy: ", lscore)
Logistic Regression Accuracy:  0.9034090909090909

The measured accuracy is 0.90 which is the highest one yet! Let's try to use this model to predict the 2023 NBA Finals winner.

In [59]:
X_2023 = df_2023[feature_cols]
y_pred_2023 = log_reg.predict(X_2023)
df_2023['WINNING_STATUS_logpred'] = y_pred_2023

# Sort DataFrame by 'WINNING_STATUS_pred' column in descending order
df_2023 = df_2023.sort_values('WINNING_STATUS_logpred', ascending=False)

# Select the first row (highest 'WINNING_STATUS_pred')
best_team = df_2023.iloc[0]['Team']
secondbest_team = df_2023.iloc[1]['Team']
print('The team with the HIGHEST predicted probability of winning the championship in 2023 is: ' + str(best_team))
print('The team with the second highest predicted probability is: ' + str(secondbest_team))
The team with the HIGHEST predicted probability of winning the championship in 2023 is: Boston Celtics
The team with the second highest predicted probability is: Philadelphia 76ers

The Logistic Regression model picks the Boston Celtics to win the NBA Finals and deems the Philadelphia 76ers as the second highest probability to win.

We have now gone over 4 different sklearn models, so now we will try to create our own model which is a neural network to see if we can predict the NBA Finals winner.

Fifth Model: CUSTOM Neural Network!

The below code is a custom neural network used that takes in input for the given columns, and outputs a probability of a team to get a winning status of 0,1, and 2. We start off by standardizing the input, which helps increase accuracy. From there, we split the data up into training, testing, and validating data. The training is used to train, testing used to test, and validating to validate during the training process. Our layer structure is arbitrary, and can be considered a hyperparameter. We messed with a couple of options, and this produced the highest accuracy. We then compile the model with the binary cross entropy as the loss function, and the adam optimizer. After training the model, we pass in the 2023 data to get an output and prediction on who will win the championship this year.

In [62]:
#imports
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.utils import to_categorical

# copy dataframe
df_dec = df.copy()

# categories for neural network
x = df_dec[["FG", "FGA", "FG%", "3P", "3P%", "2P", "2P%", "FT",
      "FTA", "FT%", "AST", "STL", "BLK", "TOV", "PF", "PTS", "Year",
     "W", "L", "OR", "DR"]]
y = df_dec.WINNING_STATUS

# One-hot encode the labels
y = to_categorical(y, num_classes=3)

# Normalize the input data to ensure better performance
scaler = StandardScaler()
x = scaler.fit_transform(x)

#
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.2, random_state=42)

# Define the model architecture uses softmax to return probability for the winning status to be 0,1, and 2. They all should add up to 1.
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(x_train.shape[1],)),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(8, activation='relu'),
    tf.keras.layers.Dense(3, activation='softmax')
])

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model for 200 epochs
model.fit(x_train, y_train, epochs=200, batch_size=32, validation_data=(x_val, y_val))

# Evaluate the model
test_loss, test_acc = model.evaluate(x_test, y_test)
print('Test accuracy:', test_acc)
Epoch 1/200
24/24 [==============================] - 2s 16ms/step - loss: 0.6803 - accuracy: 0.4820 - val_loss: 0.6229 - val_accuracy: 0.7074
Epoch 2/200
24/24 [==============================] - 0s 6ms/step - loss: 0.5832 - accuracy: 0.6555 - val_loss: 0.5056 - val_accuracy: 0.7872
Epoch 3/200
24/24 [==============================] - 0s 5ms/step - loss: 0.4718 - accuracy: 0.7797 - val_loss: 0.4089 - val_accuracy: 0.7979
Epoch 4/200
24/24 [==============================] - 0s 5ms/step - loss: 0.3747 - accuracy: 0.8117 - val_loss: 0.3422 - val_accuracy: 0.8298
Epoch 5/200
24/24 [==============================] - 0s 5ms/step - loss: 0.2947 - accuracy: 0.8678 - val_loss: 0.3091 - val_accuracy: 0.8404
Epoch 6/200
24/24 [==============================] - 0s 5ms/step - loss: 0.2470 - accuracy: 0.8812 - val_loss: 0.3012 - val_accuracy: 0.8404
Epoch 7/200
24/24 [==============================] - 0s 5ms/step - loss: 0.2481 - accuracy: 0.8758 - val_loss: 0.2881 - val_accuracy: 0.8670
Epoch 8/200
24/24 [==============================] - 0s 5ms/step - loss: 0.2285 - accuracy: 0.9025 - val_loss: 0.2798 - val_accuracy: 0.8617
Epoch 9/200
24/24 [==============================] - 0s 5ms/step - loss: 0.2171 - accuracy: 0.9039 - val_loss: 0.2732 - val_accuracy: 0.8564
Epoch 10/200
24/24 [==============================] - 0s 5ms/step - loss: 0.2178 - accuracy: 0.8879 - val_loss: 0.2721 - val_accuracy: 0.8511
Epoch 11/200
24/24 [==============================] - 0s 6ms/step - loss: 0.2084 - accuracy: 0.9012 - val_loss: 0.2747 - val_accuracy: 0.8617
Epoch 12/200
24/24 [==============================] - 0s 9ms/step - loss: 0.2118 - accuracy: 0.9012 - val_loss: 0.2686 - val_accuracy: 0.8617
Epoch 13/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1969 - accuracy: 0.9052 - val_loss: 0.2646 - val_accuracy: 0.8617
Epoch 14/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1967 - accuracy: 0.8959 - val_loss: 0.2589 - val_accuracy: 0.8617
Epoch 15/200
24/24 [==============================] - 0s 6ms/step - loss: 0.1888 - accuracy: 0.8985 - val_loss: 0.2738 - val_accuracy: 0.8617
Epoch 16/200
24/24 [==============================] - 0s 6ms/step - loss: 0.1782 - accuracy: 0.9172 - val_loss: 0.2631 - val_accuracy: 0.8564
Epoch 17/200
24/24 [==============================] - 0s 6ms/step - loss: 0.1700 - accuracy: 0.9146 - val_loss: 0.2690 - val_accuracy: 0.8564
Epoch 18/200
24/24 [==============================] - 0s 6ms/step - loss: 0.1665 - accuracy: 0.9132 - val_loss: 0.2761 - val_accuracy: 0.8511
Epoch 19/200
24/24 [==============================] - 0s 6ms/step - loss: 0.1743 - accuracy: 0.9065 - val_loss: 0.2700 - val_accuracy: 0.8564
Epoch 20/200
24/24 [==============================] - 0s 6ms/step - loss: 0.1838 - accuracy: 0.9065 - val_loss: 0.2783 - val_accuracy: 0.8511
Epoch 21/200
24/24 [==============================] - 0s 6ms/step - loss: 0.1619 - accuracy: 0.9146 - val_loss: 0.2733 - val_accuracy: 0.8564
Epoch 22/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1619 - accuracy: 0.9052 - val_loss: 0.2638 - val_accuracy: 0.8511
Epoch 23/200
24/24 [==============================] - 0s 6ms/step - loss: 0.1706 - accuracy: 0.9105 - val_loss: 0.2668 - val_accuracy: 0.8511
Epoch 24/200
24/24 [==============================] - 0s 6ms/step - loss: 0.1503 - accuracy: 0.9172 - val_loss: 0.2789 - val_accuracy: 0.8511
Epoch 25/200
24/24 [==============================] - 0s 6ms/step - loss: 0.1550 - accuracy: 0.9172 - val_loss: 0.2772 - val_accuracy: 0.8511
Epoch 26/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1434 - accuracy: 0.9186 - val_loss: 0.2697 - val_accuracy: 0.8511
Epoch 27/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1562 - accuracy: 0.9159 - val_loss: 0.2664 - val_accuracy: 0.8511
Epoch 28/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1507 - accuracy: 0.9199 - val_loss: 0.2774 - val_accuracy: 0.8511
Epoch 29/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1443 - accuracy: 0.9105 - val_loss: 0.2665 - val_accuracy: 0.8564
Epoch 30/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1509 - accuracy: 0.9239 - val_loss: 0.2799 - val_accuracy: 0.8564
Epoch 31/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1386 - accuracy: 0.9239 - val_loss: 0.2723 - val_accuracy: 0.8404
Epoch 32/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1558 - accuracy: 0.9132 - val_loss: 0.2782 - val_accuracy: 0.8511
Epoch 33/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1481 - accuracy: 0.9212 - val_loss: 0.2733 - val_accuracy: 0.8457
Epoch 34/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1277 - accuracy: 0.9359 - val_loss: 0.2731 - val_accuracy: 0.8564
Epoch 35/200
24/24 [==============================] - 0s 6ms/step - loss: 0.1309 - accuracy: 0.9199 - val_loss: 0.2780 - val_accuracy: 0.8511
Epoch 36/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1322 - accuracy: 0.9292 - val_loss: 0.2693 - val_accuracy: 0.8564
Epoch 37/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1349 - accuracy: 0.9212 - val_loss: 0.2759 - val_accuracy: 0.8564
Epoch 38/200
24/24 [==============================] - 0s 6ms/step - loss: 0.1368 - accuracy: 0.9212 - val_loss: 0.2760 - val_accuracy: 0.8564
Epoch 39/200
24/24 [==============================] - 0s 6ms/step - loss: 0.1300 - accuracy: 0.9172 - val_loss: 0.2795 - val_accuracy: 0.8511
Epoch 40/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1298 - accuracy: 0.9199 - val_loss: 0.2880 - val_accuracy: 0.8511
Epoch 41/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1288 - accuracy: 0.9266 - val_loss: 0.2915 - val_accuracy: 0.8404
Epoch 42/200
24/24 [==============================] - 0s 6ms/step - loss: 0.1268 - accuracy: 0.9252 - val_loss: 0.2749 - val_accuracy: 0.8564
Epoch 43/200
24/24 [==============================] - 0s 8ms/step - loss: 0.1272 - accuracy: 0.9266 - val_loss: 0.2769 - val_accuracy: 0.8617
Epoch 44/200
24/24 [==============================] - 0s 6ms/step - loss: 0.1277 - accuracy: 0.9306 - val_loss: 0.2861 - val_accuracy: 0.8564
Epoch 45/200
24/24 [==============================] - 0s 6ms/step - loss: 0.1229 - accuracy: 0.9319 - val_loss: 0.2916 - val_accuracy: 0.8511
Epoch 46/200
24/24 [==============================] - 0s 6ms/step - loss: 0.1281 - accuracy: 0.9292 - val_loss: 0.2723 - val_accuracy: 0.8617
Epoch 47/200
24/24 [==============================] - 0s 6ms/step - loss: 0.1150 - accuracy: 0.9399 - val_loss: 0.2867 - val_accuracy: 0.8511
Epoch 48/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1271 - accuracy: 0.9266 - val_loss: 0.2815 - val_accuracy: 0.8457
Epoch 49/200
24/24 [==============================] - 0s 8ms/step - loss: 0.1366 - accuracy: 0.9239 - val_loss: 0.2796 - val_accuracy: 0.8511
Epoch 50/200
24/24 [==============================] - 0s 6ms/step - loss: 0.1312 - accuracy: 0.9252 - val_loss: 0.2853 - val_accuracy: 0.8511
Epoch 51/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1160 - accuracy: 0.9426 - val_loss: 0.2891 - val_accuracy: 0.8564
Epoch 52/200
24/24 [==============================] - 0s 6ms/step - loss: 0.1287 - accuracy: 0.9332 - val_loss: 0.2875 - val_accuracy: 0.8564
Epoch 53/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1262 - accuracy: 0.9226 - val_loss: 0.2865 - val_accuracy: 0.8564
Epoch 54/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1112 - accuracy: 0.9399 - val_loss: 0.2829 - val_accuracy: 0.8617
Epoch 55/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1181 - accuracy: 0.9386 - val_loss: 0.2807 - val_accuracy: 0.8670
Epoch 56/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1119 - accuracy: 0.9399 - val_loss: 0.2818 - val_accuracy: 0.8723
Epoch 57/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1269 - accuracy: 0.9359 - val_loss: 0.2753 - val_accuracy: 0.8617
Epoch 58/200
24/24 [==============================] - 0s 9ms/step - loss: 0.1166 - accuracy: 0.9346 - val_loss: 0.2830 - val_accuracy: 0.8670
Epoch 59/200
24/24 [==============================] - 0s 6ms/step - loss: 0.1111 - accuracy: 0.9332 - val_loss: 0.2854 - val_accuracy: 0.8617
Epoch 60/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1061 - accuracy: 0.9372 - val_loss: 0.2831 - val_accuracy: 0.8564
Epoch 61/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1014 - accuracy: 0.9426 - val_loss: 0.3023 - val_accuracy: 0.8564
Epoch 62/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1045 - accuracy: 0.9399 - val_loss: 0.2950 - val_accuracy: 0.8511
Epoch 63/200
24/24 [==============================] - 0s 6ms/step - loss: 0.1135 - accuracy: 0.9372 - val_loss: 0.2931 - val_accuracy: 0.8564
Epoch 64/200
24/24 [==============================] - 0s 6ms/step - loss: 0.1008 - accuracy: 0.9413 - val_loss: 0.2900 - val_accuracy: 0.8670
Epoch 65/200
24/24 [==============================] - 0s 6ms/step - loss: 0.0949 - accuracy: 0.9399 - val_loss: 0.2986 - val_accuracy: 0.8670
Epoch 66/200
24/24 [==============================] - 0s 5ms/step - loss: 0.1024 - accuracy: 0.9386 - val_loss: 0.2919 - val_accuracy: 0.8670
Epoch 67/200
24/24 [==============================] - 0s 8ms/step - loss: 0.1138 - accuracy: 0.9426 - val_loss: 0.2818 - val_accuracy: 0.8723
Epoch 68/200
24/24 [==============================] - 0s 6ms/step - loss: 0.1064 - accuracy: 0.9413 - val_loss: 0.2830 - val_accuracy: 0.8723
Epoch 69/200
24/24 [==============================] - 0s 6ms/step - loss: 0.0946 - accuracy: 0.9426 - val_loss: 0.3014 - val_accuracy: 0.8723
Epoch 70/200
24/24 [==============================] - 0s 6ms/step - loss: 0.0979 - accuracy: 0.9479 - val_loss: 0.2977 - val_accuracy: 0.8670
Epoch 71/200
24/24 [==============================] - 0s 6ms/step - loss: 0.1022 - accuracy: 0.9359 - val_loss: 0.2915 - val_accuracy: 0.8723
Epoch 72/200
24/24 [==============================] - 0s 6ms/step - loss: 0.0911 - accuracy: 0.9479 - val_loss: 0.3033 - val_accuracy: 0.8670
Epoch 73/200
24/24 [==============================] - 0s 12ms/step - loss: 0.1051 - accuracy: 0.9439 - val_loss: 0.3010 - val_accuracy: 0.8617
Epoch 74/200
24/24 [==============================] - 0s 10ms/step - loss: 0.0934 - accuracy: 0.9453 - val_loss: 0.3067 - val_accuracy: 0.8617
Epoch 75/200
24/24 [==============================] - 0s 10ms/step - loss: 0.0930 - accuracy: 0.9426 - val_loss: 0.3331 - val_accuracy: 0.8564
Epoch 76/200
24/24 [==============================] - 0s 6ms/step - loss: 0.0978 - accuracy: 0.9439 - val_loss: 0.3258 - val_accuracy: 0.8670
Epoch 77/200
24/24 [==============================] - 0s 6ms/step - loss: 0.0823 - accuracy: 0.9573 - val_loss: 0.3191 - val_accuracy: 0.8670
Epoch 78/200
24/24 [==============================] - 0s 10ms/step - loss: 0.1014 - accuracy: 0.9439 - val_loss: 0.3201 - val_accuracy: 0.8511
Epoch 79/200
24/24 [==============================] - 0s 7ms/step - loss: 0.0824 - accuracy: 0.9466 - val_loss: 0.3100 - val_accuracy: 0.8617
Epoch 80/200
24/24 [==============================] - 0s 7ms/step - loss: 0.0856 - accuracy: 0.9519 - val_loss: 0.3277 - val_accuracy: 0.8564
Epoch 81/200
24/24 [==============================] - 0s 6ms/step - loss: 0.0942 - accuracy: 0.9519 - val_loss: 0.3228 - val_accuracy: 0.8511
Epoch 82/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0996 - accuracy: 0.9399 - val_loss: 0.3204 - val_accuracy: 0.8617
Epoch 83/200
24/24 [==============================] - 0s 6ms/step - loss: 0.0937 - accuracy: 0.9426 - val_loss: 0.3163 - val_accuracy: 0.8564
Epoch 84/200
24/24 [==============================] - 0s 9ms/step - loss: 0.1007 - accuracy: 0.9399 - val_loss: 0.3041 - val_accuracy: 0.8564
Epoch 85/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0952 - accuracy: 0.9519 - val_loss: 0.3061 - val_accuracy: 0.8617
Epoch 86/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0949 - accuracy: 0.9546 - val_loss: 0.2967 - val_accuracy: 0.8617
Epoch 87/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0874 - accuracy: 0.9479 - val_loss: 0.3041 - val_accuracy: 0.8723
Epoch 88/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0881 - accuracy: 0.9519 - val_loss: 0.3206 - val_accuracy: 0.8617
Epoch 89/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0860 - accuracy: 0.9493 - val_loss: 0.3314 - val_accuracy: 0.8617
Epoch 90/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0820 - accuracy: 0.9493 - val_loss: 0.3280 - val_accuracy: 0.8511
Epoch 91/200
24/24 [==============================] - 0s 6ms/step - loss: 0.0806 - accuracy: 0.9533 - val_loss: 0.3675 - val_accuracy: 0.8564
Epoch 92/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0799 - accuracy: 0.9599 - val_loss: 0.3543 - val_accuracy: 0.8404
Epoch 93/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0916 - accuracy: 0.9479 - val_loss: 0.3237 - val_accuracy: 0.8351
Epoch 94/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0840 - accuracy: 0.9573 - val_loss: 0.3529 - val_accuracy: 0.8457
Epoch 95/200
24/24 [==============================] - 0s 6ms/step - loss: 0.0820 - accuracy: 0.9519 - val_loss: 0.3892 - val_accuracy: 0.8404
Epoch 96/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0892 - accuracy: 0.9479 - val_loss: 0.3505 - val_accuracy: 0.8564
Epoch 97/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0716 - accuracy: 0.9559 - val_loss: 0.3313 - val_accuracy: 0.8617
Epoch 98/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0743 - accuracy: 0.9533 - val_loss: 0.3508 - val_accuracy: 0.8457
Epoch 99/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0810 - accuracy: 0.9519 - val_loss: 0.3692 - val_accuracy: 0.8404
Epoch 100/200
24/24 [==============================] - 0s 6ms/step - loss: 0.0778 - accuracy: 0.9519 - val_loss: 0.3835 - val_accuracy: 0.8511
Epoch 101/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0846 - accuracy: 0.9453 - val_loss: 0.3645 - val_accuracy: 0.8617
Epoch 102/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0799 - accuracy: 0.9559 - val_loss: 0.3840 - val_accuracy: 0.8351
Epoch 103/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0794 - accuracy: 0.9533 - val_loss: 0.3744 - val_accuracy: 0.8511
Epoch 104/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0774 - accuracy: 0.9613 - val_loss: 0.3882 - val_accuracy: 0.8511
Epoch 105/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0875 - accuracy: 0.9453 - val_loss: 0.3565 - val_accuracy: 0.8564
Epoch 106/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0733 - accuracy: 0.9586 - val_loss: 0.4019 - val_accuracy: 0.8404
Epoch 107/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0769 - accuracy: 0.9546 - val_loss: 0.3803 - val_accuracy: 0.8404
Epoch 108/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0804 - accuracy: 0.9493 - val_loss: 0.3884 - val_accuracy: 0.8617
Epoch 109/200
24/24 [==============================] - 0s 6ms/step - loss: 0.0748 - accuracy: 0.9573 - val_loss: 0.3719 - val_accuracy: 0.8511
Epoch 110/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0668 - accuracy: 0.9680 - val_loss: 0.3868 - val_accuracy: 0.8511
Epoch 111/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0733 - accuracy: 0.9613 - val_loss: 0.3954 - val_accuracy: 0.8511
Epoch 112/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0757 - accuracy: 0.9613 - val_loss: 0.3807 - val_accuracy: 0.8457
Epoch 113/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0714 - accuracy: 0.9626 - val_loss: 0.3795 - val_accuracy: 0.8617
Epoch 114/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0703 - accuracy: 0.9613 - val_loss: 0.4027 - val_accuracy: 0.8511
Epoch 115/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0679 - accuracy: 0.9613 - val_loss: 0.3945 - val_accuracy: 0.8511
Epoch 116/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0688 - accuracy: 0.9640 - val_loss: 0.4025 - val_accuracy: 0.8511
Epoch 117/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0660 - accuracy: 0.9640 - val_loss: 0.3848 - val_accuracy: 0.8617
Epoch 118/200
24/24 [==============================] - 0s 8ms/step - loss: 0.0664 - accuracy: 0.9640 - val_loss: 0.4085 - val_accuracy: 0.8457
Epoch 119/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0752 - accuracy: 0.9640 - val_loss: 0.3679 - val_accuracy: 0.8511
Epoch 120/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0589 - accuracy: 0.9720 - val_loss: 0.4496 - val_accuracy: 0.8564
Epoch 121/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0653 - accuracy: 0.9519 - val_loss: 0.4104 - val_accuracy: 0.8457
Epoch 122/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0689 - accuracy: 0.9613 - val_loss: 0.4214 - val_accuracy: 0.8457
Epoch 123/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0714 - accuracy: 0.9626 - val_loss: 0.4218 - val_accuracy: 0.8404
Epoch 124/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0583 - accuracy: 0.9653 - val_loss: 0.4413 - val_accuracy: 0.8404
Epoch 125/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0688 - accuracy: 0.9533 - val_loss: 0.3954 - val_accuracy: 0.8511
Epoch 126/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0567 - accuracy: 0.9666 - val_loss: 0.4431 - val_accuracy: 0.8457
Epoch 127/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0591 - accuracy: 0.9640 - val_loss: 0.4677 - val_accuracy: 0.8511
Epoch 128/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0530 - accuracy: 0.9813 - val_loss: 0.4643 - val_accuracy: 0.8511
Epoch 129/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0622 - accuracy: 0.9653 - val_loss: 0.4464 - val_accuracy: 0.8511
Epoch 130/200
24/24 [==============================] - 0s 6ms/step - loss: 0.0622 - accuracy: 0.9626 - val_loss: 0.4539 - val_accuracy: 0.8351
Epoch 131/200
24/24 [==============================] - 0s 10ms/step - loss: 0.0566 - accuracy: 0.9653 - val_loss: 0.4532 - val_accuracy: 0.8404
Epoch 132/200
24/24 [==============================] - 0s 6ms/step - loss: 0.0484 - accuracy: 0.9813 - val_loss: 0.4751 - val_accuracy: 0.8511
Epoch 133/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0565 - accuracy: 0.9680 - val_loss: 0.4522 - val_accuracy: 0.8564
Epoch 134/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0609 - accuracy: 0.9680 - val_loss: 0.4567 - val_accuracy: 0.8617
Epoch 135/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0623 - accuracy: 0.9693 - val_loss: 0.4747 - val_accuracy: 0.8617
Epoch 136/200
24/24 [==============================] - 0s 7ms/step - loss: 0.0622 - accuracy: 0.9626 - val_loss: 0.4601 - val_accuracy: 0.8404
Epoch 137/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0652 - accuracy: 0.9626 - val_loss: 0.4516 - val_accuracy: 0.8511
Epoch 138/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0501 - accuracy: 0.9760 - val_loss: 0.4466 - val_accuracy: 0.8457
Epoch 139/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0509 - accuracy: 0.9666 - val_loss: 0.4612 - val_accuracy: 0.8511
Epoch 140/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0453 - accuracy: 0.9773 - val_loss: 0.5029 - val_accuracy: 0.8457
Epoch 141/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0502 - accuracy: 0.9746 - val_loss: 0.4703 - val_accuracy: 0.8404
Epoch 142/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0662 - accuracy: 0.9599 - val_loss: 0.4761 - val_accuracy: 0.8511
Epoch 143/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0621 - accuracy: 0.9693 - val_loss: 0.4833 - val_accuracy: 0.8511
Epoch 144/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0529 - accuracy: 0.9746 - val_loss: 0.4752 - val_accuracy: 0.8511
Epoch 145/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0523 - accuracy: 0.9786 - val_loss: 0.4586 - val_accuracy: 0.8511
Epoch 146/200
24/24 [==============================] - 0s 6ms/step - loss: 0.0693 - accuracy: 0.9613 - val_loss: 0.4512 - val_accuracy: 0.8617
Epoch 147/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0522 - accuracy: 0.9666 - val_loss: 0.4818 - val_accuracy: 0.8511
Epoch 148/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0493 - accuracy: 0.9706 - val_loss: 0.4843 - val_accuracy: 0.8457
Epoch 149/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0544 - accuracy: 0.9720 - val_loss: 0.4744 - val_accuracy: 0.8511
Epoch 150/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0521 - accuracy: 0.9706 - val_loss: 0.4754 - val_accuracy: 0.8511
Epoch 151/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0515 - accuracy: 0.9733 - val_loss: 0.4807 - val_accuracy: 0.8457
Epoch 152/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0431 - accuracy: 0.9773 - val_loss: 0.4835 - val_accuracy: 0.8457
Epoch 153/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0559 - accuracy: 0.9720 - val_loss: 0.5238 - val_accuracy: 0.8511
Epoch 154/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0487 - accuracy: 0.9706 - val_loss: 0.4911 - val_accuracy: 0.8511
Epoch 155/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0423 - accuracy: 0.9786 - val_loss: 0.5085 - val_accuracy: 0.8457
Epoch 156/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0381 - accuracy: 0.9786 - val_loss: 0.5106 - val_accuracy: 0.8351
Epoch 157/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0576 - accuracy: 0.9693 - val_loss: 0.4888 - val_accuracy: 0.8457
Epoch 158/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0562 - accuracy: 0.9666 - val_loss: 0.5031 - val_accuracy: 0.8511
Epoch 159/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0469 - accuracy: 0.9706 - val_loss: 0.5013 - val_accuracy: 0.8404
Epoch 160/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0524 - accuracy: 0.9746 - val_loss: 0.4899 - val_accuracy: 0.8351
Epoch 161/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0546 - accuracy: 0.9760 - val_loss: 0.5110 - val_accuracy: 0.8404
Epoch 162/200
24/24 [==============================] - 0s 6ms/step - loss: 0.0463 - accuracy: 0.9746 - val_loss: 0.5447 - val_accuracy: 0.8457
Epoch 163/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0538 - accuracy: 0.9733 - val_loss: 0.5167 - val_accuracy: 0.8457
Epoch 164/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0439 - accuracy: 0.9733 - val_loss: 0.5350 - val_accuracy: 0.8351
Epoch 165/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0500 - accuracy: 0.9746 - val_loss: 0.5514 - val_accuracy: 0.8404
Epoch 166/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0536 - accuracy: 0.9733 - val_loss: 0.5438 - val_accuracy: 0.8351
Epoch 167/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0682 - accuracy: 0.9586 - val_loss: 0.4813 - val_accuracy: 0.8457
Epoch 168/200
24/24 [==============================] - 0s 6ms/step - loss: 0.0470 - accuracy: 0.9760 - val_loss: 0.5017 - val_accuracy: 0.8404
Epoch 169/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0658 - accuracy: 0.9666 - val_loss: 0.5115 - val_accuracy: 0.8564
Epoch 170/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0520 - accuracy: 0.9733 - val_loss: 0.4946 - val_accuracy: 0.8457
Epoch 171/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0497 - accuracy: 0.9693 - val_loss: 0.4869 - val_accuracy: 0.8404
Epoch 172/200
24/24 [==============================] - 0s 6ms/step - loss: 0.0318 - accuracy: 0.9866 - val_loss: 0.5245 - val_accuracy: 0.8457
Epoch 173/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0451 - accuracy: 0.9800 - val_loss: 0.5289 - val_accuracy: 0.8511
Epoch 174/200
24/24 [==============================] - 0s 6ms/step - loss: 0.0417 - accuracy: 0.9746 - val_loss: 0.5406 - val_accuracy: 0.8298
Epoch 175/200
24/24 [==============================] - 0s 11ms/step - loss: 0.0361 - accuracy: 0.9826 - val_loss: 0.5379 - val_accuracy: 0.8404
Epoch 176/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0480 - accuracy: 0.9720 - val_loss: 0.5727 - val_accuracy: 0.8457
Epoch 177/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0493 - accuracy: 0.9760 - val_loss: 0.5670 - val_accuracy: 0.8404
Epoch 178/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0520 - accuracy: 0.9666 - val_loss: 0.5355 - val_accuracy: 0.8457
Epoch 179/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0529 - accuracy: 0.9760 - val_loss: 0.5426 - val_accuracy: 0.8457
Epoch 180/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0444 - accuracy: 0.9720 - val_loss: 0.5398 - val_accuracy: 0.8404
Epoch 181/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0405 - accuracy: 0.9800 - val_loss: 0.4989 - val_accuracy: 0.8404
Epoch 182/200
24/24 [==============================] - 0s 6ms/step - loss: 0.0400 - accuracy: 0.9733 - val_loss: 0.4882 - val_accuracy: 0.8511
Epoch 183/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0395 - accuracy: 0.9813 - val_loss: 0.4955 - val_accuracy: 0.8457
Epoch 184/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0331 - accuracy: 0.9800 - val_loss: 0.5274 - val_accuracy: 0.8404
Epoch 185/200
24/24 [==============================] - 0s 6ms/step - loss: 0.0424 - accuracy: 0.9760 - val_loss: 0.5383 - val_accuracy: 0.8457
Epoch 186/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0415 - accuracy: 0.9813 - val_loss: 0.5761 - val_accuracy: 0.8457
Epoch 187/200
24/24 [==============================] - 0s 6ms/step - loss: 0.0419 - accuracy: 0.9733 - val_loss: 0.4987 - val_accuracy: 0.8457
Epoch 188/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0318 - accuracy: 0.9813 - val_loss: 0.5374 - val_accuracy: 0.8457
Epoch 189/200
24/24 [==============================] - 0s 6ms/step - loss: 0.0292 - accuracy: 0.9840 - val_loss: 0.5564 - val_accuracy: 0.8457
Epoch 190/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0341 - accuracy: 0.9866 - val_loss: 0.5952 - val_accuracy: 0.8511
Epoch 191/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0365 - accuracy: 0.9773 - val_loss: 0.5923 - val_accuracy: 0.8511
Epoch 192/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0335 - accuracy: 0.9826 - val_loss: 0.5971 - val_accuracy: 0.8457
Epoch 193/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0468 - accuracy: 0.9706 - val_loss: 0.5978 - val_accuracy: 0.8404
Epoch 194/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0493 - accuracy: 0.9746 - val_loss: 0.5824 - val_accuracy: 0.8457
Epoch 195/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0572 - accuracy: 0.9733 - val_loss: 0.5625 - val_accuracy: 0.8404
Epoch 196/200
24/24 [==============================] - 0s 6ms/step - loss: 0.0313 - accuracy: 0.9760 - val_loss: 0.5672 - val_accuracy: 0.8404
Epoch 197/200
24/24 [==============================] - 0s 7ms/step - loss: 0.0382 - accuracy: 0.9826 - val_loss: 0.5987 - val_accuracy: 0.8511
Epoch 198/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0507 - accuracy: 0.9706 - val_loss: 0.5544 - val_accuracy: 0.8404
Epoch 199/200
24/24 [==============================] - 0s 6ms/step - loss: 0.0530 - accuracy: 0.9706 - val_loss: 0.5341 - val_accuracy: 0.8457
Epoch 200/200
24/24 [==============================] - 0s 5ms/step - loss: 0.0394 - accuracy: 0.9800 - val_loss: 0.5422 - val_accuracy: 0.8351
8/8 [==============================] - 0s 2ms/step - loss: 0.5707 - accuracy: 0.8511
Test accuracy: 0.8510638475418091
In [68]:
# Make predictions on 2023 data
X_2023 = df_2023[["FG", "FGA", "FG%", "3P", "3P%", "2P", "2P%", "FT",
      "FTA", "FT%", "AST", "STL", "BLK", "TOV", "PF", "PTS", "Year",
     "W", "L", "OR", "DR"]]
scaler = StandardScaler()
y_pred_2023 = model.predict(scaler.fit_transform(X_2023))

# Add the predicted winning status probabilities to the dataframe
df_2023['WINNING_STATUS_prob_0'] = y_pred_2023[:, 0]
df_2023['WINNING_STATUS_prob_1'] = y_pred_2023[:, 1]
df_2023['WINNING_STATUS_prob_2'] = y_pred_2023[:, 2]

# Sort the teams by predicted winning status probability in descending order for each category
df_2023 = df_2023.sort_values(by=['WINNING_STATUS_prob_2'], ascending=False)

best_team = df_2023.iloc[0]["Team"]
secondbest_team = df_2023.iloc[1]["Team"]
print('The team with the HIGHEST predicted probability of winning the championship in 2023 is: ' + str(best_team))
print('The team with the second highest predicted probability is: ' + str(secondbest_team))
1/1 [==============================] - 0s 35ms/step
The team with the HIGHEST predicted probability of winning the championship in 2023 is: Milwaukee Bucks
The team with the second highest predicted probability is: Memphis Grizzlies

As we can see, the Neural Network predicts the Milwaukee Bucks to win the 2023 NBA Finals. Now let's move onto the conclusion to recap everything we have just done.

Insight and Conclusions

This is the 5th and final step of the data science lifecycle where we use our data explorations and models to make a prediction as to who will win the 2023 NBA Finals.

We were able to figure out some key details by exploring and modeling our data:

  1. The features that are the most applicable when modeling are "FG", "3P", "2P", "FT%", "AST", "STL", "W", "L" because these are the features used in the Logistic Regression which has the highest accuracy.
  2. 4/5 models that use these key features predict that the Boston Celtics will win the 2023 NBA Championship.
By using these key details, Our analysis of the NBA teams predicts that the Boston Celtics are the favorites to win the NBA champions. Given the narration, and how they are playing this makes sense and is justifiable. The one model of ours that contradicted this is our custom neural network, which predicted that the Milwaukee Bucks would win the Finals. Before the season started, the Milwaukee Bucks were the favorites but unfortunately lost in the first round due to injury and other circumstances. These unexpected changes couldn't be accounted for properly in our model, but this project has shown, that by the numbers, the Boston Celtics are favorites to win the NBA championship!

For further research, we can consider the various eras in NBA history and how playing styles could affect the importance of each of the predictors. We could also try using more defensive statistics and see if they play a factor.

To recap, the data science lifecycle is data collection, data processing, exploratory analysis, machine learning, and finally drawing insights and conclusions. We hope that this tutorial serves as an example for your own projects in the future!